Tips of Drafting R Markdown Document
Introduction
When presenting the data overview and exploratory analysis results, we used to copy a lots tables, charts from Rstudio to PowerPoint, which makes the presentation preparation painful. It become essential for data scientist to make use of better reporting tools, such as R markdown, jupyter notebooks to author analysis presentation in a more efficient and organized way, of course, we also want this to be reproducible!
In this post, I would like to share some tips when I explore building analysis report using R markdown/notebook.
Tables
The native markdown table isn’t very user-friendly, so we have to make use of functions such as knitr::kable or DT::datatable to render the table from data.frame.
I would like to share some tips on choosing between kable and datatable.
kablehas simpler syntax and give more appealing “table like” tables in most themes.datatablehas more capability such as paged tables with download buttons. There are more configurations could be referred from its JavaScript API specifications.
In a nutshell, kable is preferable for smaller tables, while datatable is preferable for bigger tables.
markdown table
some random markdown table
| Tables | Are | Cool |
|----------|:-------------:|------:|
| col 1 is | left-aligned | $1600 |
| col 2 is | centered | $12 |
| col 3 is | right-aligned | $1 |
| Tables | Are | Cool |
|---|---|---|
| col 1 is | left-aligned | $1600 |
| col 2 is | centered | $12 |
| col 3 is | right-aligned | $1 |
kable
kable is from knitr package
require(knitr)
require(kableExtra)
mtcars %>%
head() %>%
kable(digits = 1, caption = 'example of kable table') %>%
kable_styling(full_width = FALSE, position = 'left') %>%
row_spec(0,
bold = T,
color = 'white',
background = 'black')| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.9 | 2.6 | 16.5 | 0 | 1 | 4 | 4 |
| Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.9 | 2.9 | 17.0 | 0 | 1 | 4 | 4 |
| Datsun 710 | 22.8 | 4 | 108 | 93 | 3.9 | 2.3 | 18.6 | 1 | 1 | 4 | 1 |
| Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.1 | 3.2 | 19.4 | 1 | 0 | 3 | 1 |
| Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.1 | 3.4 | 17.0 | 0 | 0 | 3 | 2 |
| Valiant | 18.1 | 6 | 225 | 105 | 2.8 | 3.5 | 20.2 | 1 | 0 | 3 | 1 |
datatable
datatable is from DT package
JS - DataTables
options list: https://datatables.net/reference/option/
R - DT package
Data Summary
Data quality is often required to check before any real analytics works. Since this is very routine job, there’s a package called summarytools well handles this. The following beautiful data summary report is generated for the example dataset mtcars using summarytools.
summartoolsgithub page: https://github.com/dcomtois/summarytools
require(summarytools)
mtcars %>%
dfSummary(style = 'grid',
graph.magnif = 0.75,
plain.ascii = F,
valid.col = FALSE,
tmp.img.dir = "/tmp") %>%
print()Data Frame Summary
mtcars
Dimensions: 32 x 11
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | mpg [numeric] |
Mean (sd) : 20.1 (6) min < med < max: 10.4 < 19.2 < 33.9 IQR (CV) : 7.4 (0.3) |
25 distinct values | 0 (0%) |
|
| 2 | cyl [numeric] |
Mean (sd) : 6.2 (1.8) min < med < max: 4 < 6 < 8 IQR (CV) : 4 (0.3) |
4 : 11 (34.4%) 6 : 7 (21.9%) 8 : 14 (43.8%) |
0 (0%) |
|
| 3 | disp [numeric] |
Mean (sd) : 230.7 (123.9) min < med < max: 71.1 < 196.3 < 472 IQR (CV) : 205.2 (0.5) |
27 distinct values | 0 (0%) |
|
| 4 | hp [numeric] |
Mean (sd) : 146.7 (68.6) min < med < max: 52 < 123 < 335 IQR (CV) : 83.5 (0.5) |
22 distinct values | 0 (0%) |
|
| 5 | drat [numeric] |
Mean (sd) : 3.6 (0.5) min < med < max: 2.8 < 3.7 < 4.9 IQR (CV) : 0.8 (0.1) |
22 distinct values | 0 (0%) |
|
| 6 | wt [numeric] |
Mean (sd) : 3.2 (1) min < med < max: 1.5 < 3.3 < 5.4 IQR (CV) : 1 (0.3) |
29 distinct values | 0 (0%) |
|
| 7 | qsec [numeric] |
Mean (sd) : 17.8 (1.8) min < med < max: 14.5 < 17.7 < 22.9 IQR (CV) : 2 (0.1) |
30 distinct values | 0 (0%) |
|
| 8 | vs [numeric] |
Min : 0 Mean : 0.4 Max : 1 |
0 : 18 (56.2%) 1 : 14 (43.8%) |
0 (0%) |
|
| 9 | am [numeric] |
Min : 0 Mean : 0.4 Max : 1 |
0 : 19 (59.4%) 1 : 13 (40.6%) |
0 (0%) |
|
| 10 | gear [numeric] |
Mean (sd) : 3.7 (0.7) min < med < max: 3 < 4 < 5 IQR (CV) : 1 (0.2) |
3 : 15 (46.9%) 4 : 12 (37.5%) 5 : 5 (15.6%) |
0 (0%) |
|
| 11 | carb [numeric] |
Mean (sd) : 2.8 (1.6) min < med < max: 1 < 2 < 8 IQR (CV) : 2 (0.6) |
1 : 7 (21.9%) 2 : 10 (31.2%) 3 : 3 ( 9.4%) 4 : 10 (31.2%) 6 : 1 ( 3.1%) 8 : 1 ( 3.1%) |
0 (0%) |
Static Plots
ggplot2 is our best friend in R visualization and it has good support in R markdown. Chaining functions using %>% and + makes the code chunk beautiful!
A lot of times, we would like combined many sub-plots into one. ggplot2::facet_grid could do some of jobs, but I found ggpubr::ggarrange is more powerful that allow you to combined any plots and even tables. It’s cool to put chart and table side by side. (example is given in subsequent section)
ggrigdes is another useful ggplot extension that plots multiple density plots in a single chart. This is often used when comparing profiles between groups. check the detail from here: https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html
ggplot
require(ggplot2)
cor(mtcars) %>%
as.data.frame() %>%
tibble::rownames_to_column('var1') %>%
tidyr::pivot_longer(-var1, names_to = 'var2', values_to = 'cor') %>%
filter(var1 <= var2) %>%
ggplot(aes(x = var1, y = var2, fill = cor, label = round(cor,2))) +
geom_tile() +
geom_text() +
scale_fill_gradient2() +
labs(title = 'example of ggplot2 in R markdown')ggpubr
combine mulitple charts or tables
require(ggpubr)
require(forcats)
# add .groups = 'drop' to remove some warnings from `dplyr`
data <- mtcars %>%
group_by(gear) %>%
summarise(n = n(), .groups = 'drop') %>%
ungroup() %>%
mutate(gear = fct_rev(factor(gear)))
plt <- data %>%
ggplot(aes(x = gear, y = n)) +
geom_bar(stat = 'identity', fill = 'lightblue') +
coord_flip() +
labs(title = 'example of combined table and plot using ggpubr')
tbl <- ggtexttable(data, rows = NULL)
ggarrange(plt, tbl, ncol =2 , nrow = 1, widths = c(2,1))Interactive Plots
This is the section that becomes tricky. Interaction plots are only supported in HTML R document and there is no dominating interactive visualization packages in R environment.
plotlyprovides comprehensive chart types, documentation and cross-language capability. However I personally don’t like the style, syntax and toolbox at the right upper corner.echarts4ris a R interface forEchartsJavaScript library, which was open sourced by Baidu. I have tested thelatest version 0.3.2and it works well with R markdown.googleVisis a R interface forGoogle ChartsJavaScript library, which was of course developed by Google. The package has a good collection of different chart types, but it has some unknown incompatibility with both R markdown and Shiny. I’ve found a workaround to integrategoogleVischarts in R markdown, but it’s not perfect.
The interactive plots can’t be shown in Github rendered pages, so I just paste the code in following sections. They should be working fine on offline R markdown.
Plotly
Plotly R document site: https://plotly.com/r/
Echarts
- echarts4r github: https://echarts4r.john-coene.com/
- Echarts site: https://echarts.apache.org/en/index.html
Google Charts
- googleVis: https://github.com/mages/googleVis
- Google Charts: https://developers.google.com/chart
self_contained: false is required for googleVis charts render in R markdown, refer to the github issue here.
output:
html_document:
self_contained: false
# self_contained: false is required for googleVis charts render in R markdown
# `results = 'asis'` is required in code snippet of R markdown
suppressPackageStartupMessages(library(googleVis))
op <- options(gvis.plot.tag="chart")
plot(gvisHistogram(dino))